2023-09-20 13:41:54.AIbase.1.5k
Harvard and Columbia Release Open Dataset of 16 Million Protein Sequences, Solving the Private Data Issue for AlphaFold 2 Training!
Harvard and Columbia have released the open dataset OpenProteinSet, containing 16 million protein multiple sequence alignments and related data to support AI model training. AlphaFold 2 has achieved a milestone in the accuracy of protein structure predictions, but its proprietary data has limited progress for other researchers. OpenProteinSet includes MSAs for all proteins in the PDB, providing ample precomputed MSA resources for the protein machine learning community. This dataset can be used for various tasks in structural biology.